Better reporting for stuck Deployment #256

juliogreff · 2020-01-08T13:31:53Z

This builds on top of #242.

One of the bigger challenges for shipper users is to decide if a rollout
is waiting for the right bits to be flipped up in the kubernetes cloud,
or if something went wrong with no hope of being fixed without
intervention. Although this commit does not fundamentally fix that (as
it would be very involved and error prone, requiring the capacity
controller to have intimate details of replica sets and pods, which we
want to avoid), we're now checking for a few more conditions in the
Deployment that surface known situations:

A Deployment has just been changed, and its Status does not reflect
the brand new Spec. In that case, we considered capacity to be just in
progress.
A Deployment times out. This is not super common, as it requires users
to define a timeout in the Deployment itself. If that ever happens,
though, we're covered :)
A Deployment would violate quotas, or would otherwise cause the
ReplicaSet to be in a state of error. That's insanely common, and also
super hard to diagnose without knowing that this is an error condition
to begin with.

osdrv · 2020-01-20T15:15:49Z

pkg/controller/capacity/capacity_controller.go

 		return nil, err
 	}

-	patchString := fmt.Sprintf(`{"spec": {"replicas": %d}}`, replicaCount)
+	patch := []byte(fmt.Sprintf(`{"spec": {"replicas": %d}}`, replicaCount))


Shall we introduce some nice-looking abstraction maybe?

One of the bigger challenges for shipper users is to decide if a rollout is waiting for the right bits to be flipped up in the kubernetes cloud, or if something went wrong with no hope of being fixed without intervention. Although this commit does not fundamentally fix that (as it would be very involved and error prone, requiring the capacity controller to have intimate details of replica sets and pods, which we want to avoid), we're now checking for a few more conditions in the Deployment that surface known situations: * A Deployment has just been changed, and its Status does not reflect the brand new Spec. In that case, we considered capacity to be just in progress. * A Deployment times out. This is not super common, as it requires users to define a timeout in the Deployment itself. If that ever happens, though, we're covered :) * A Deployment would violate quotas, or would otherwise cause the ReplicaSet to be in a state of error. That's insanely common, and also super hard to diagnose without knowing that this is an error condition to begin with.

Most of these have been moved into the controllers themselves, and while that's not necessarily ideal, it's much better than a random "utils" package.

juliogreff added the enhancement New feature or request label Jan 8, 2020

juliogreff self-assigned this Jan 8, 2020

osdrv reviewed Jan 20, 2020

View reviewed changes

osdrv previously approved these changes Jan 20, 2020

View reviewed changes

juliogreff added 2 commits January 21, 2020 15:02

conditions: remove unused constants

a5d4dc5

Most of these have been moved into the controllers themselves, and while that's not necessarily ideal, it's much better than a random "utils" package.

juliogreff dismissed osdrv’s stale review via a5d4dc5 January 21, 2020 14:02

juliogreff force-pushed the jgreff/deployment-conditions branch from 2db1247 to a5d4dc5 Compare January 21, 2020 14:02

juliogreff added this to the release-0.8 milestone Jan 21, 2020

osdrv approved these changes Jan 21, 2020

View reviewed changes

hihilla approved these changes Jan 22, 2020

View reviewed changes

hihilla merged commit 7fa36c2 into master Jan 22, 2020

hihilla deleted the jgreff/deployment-conditions branch January 22, 2020 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better reporting for stuck Deployment #256

Better reporting for stuck Deployment #256

juliogreff commented Jan 8, 2020

osdrv Jan 20, 2020

Better reporting for stuck Deployment #256

Better reporting for stuck Deployment #256

Conversation

juliogreff commented Jan 8, 2020

osdrv Jan 20, 2020

Choose a reason for hiding this comment